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Abstract 

Independent component analysis (ICA) has been shown to be useful in many appli- 
cations. However, most ICA methods are sensitive to data contamination and outliers. 
In this article we introduce a general minimum [/-divergence framework for ICA, which 
covers some standard ICA methods as special cases. Within the [/-family we further 
focus on the 7-divergence due to its desirable property of super robustness, which gives 
the proposed method 7-ICA. Statistical properties and technical conditions for the con- 
sistency of 7-ICA are rigorously studied. In the limiting case, it leads to a necessary 
and sufficient condition for the consistency of MLE-ICA. This necessary and sufficient 
condition is weaker than the condition known in the literature. Since the parameter 
of interest in ICA is an orthogonal matrix, a geometrical algorithm based on gradient 
flows on special orthogonal group is introduced to implement 7-ICA. Furthermore, a 
data-driven selection for the 7 value, which is critical to the achievement of 7-ICA, is 
developed. The performance, especially the robustness, of 7-ICA in comparison with 
standard ICA methods is demonstrated through experimental studies using simulated 
data and image data. 

Key words and phrases: /3-divergence; 7-divergence; geodesic; independent compo- 
nent analysis; minimum divergence; robust statistics; special orthogonal group. 
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1 Introduction 



Consider the following generative model for independent component analysis (ICA) 

X = A3 + fi, (1) 

where the elements of the non-Gaussian source vector S £ W are mutually independent with 
zero mean, A £ W xp is an unknown nonsingular mixing matrix, X £ W is an observable 
random vector (signal), and /x = E(X) £ M. p is a shift parameter. Let Z = S~ 1 / 2 (A — /i) 
be the whitened data of X, where E = cov(X). An equivalent expression of model (1) in 
Z-scale is 

Z = AS, (2) 

where A = E _1//2 A is the mixing matrix in Z-scale. It is reported in literature that prewhiten- 
ing the data can make the ICA inference procedure more stable. In the rest of the discussion, 
we will work with model (2) in estimating the mixing matrix A based on the prewhitened 
Z. It is easy to transform back to the original X-scale via A = E X / 2 A Note that both A 
and S are unknown, and there exists the identifiability problem. This can be seen from the 
fact that Z = AS = (AM)(M~ l S) for any nonsingular diagonal matrix M. To make A 
identifiable, we assume the following conventional conditions for S: 

E(S) = and cov(S) = I p , (3) 

where I p £ W pxp is the identity matrix. It then implies that E = AA T and 

I p = cov(Z) = Acov(S) A T = AA T , (4) 

which means that the mixing matrix A in Z-scale is orthogonal. We will use notation O p 
to denote the space of orthogonal matrices in M pxp . Note that, if A £ O p is a parameter of 
model (2), so is — A £ O p . Thus, to fix one direction, we consider A £ SO p , where SO p C O p 
consists of orthogonal matrices with determinant one. This set SO p is called the special 
orthogonal group. The main purpose of ICA is to estimate the orthogonal A £ SO p based on 
the whitened data {^}" =1 , or equivalently, to look for a recovering matrix W £ SO p so that 
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components in Y =: W T Z = (wj Z, . . . , w p Z) T have the maximum degree of independence. 
In the latter case, W provides an estimate of A. 

We first briefly review some existing methods for ICA. One idea is to estimate W via 
minimizing the mutual information. Let gy be the joint probability density function of 
Y = (Yi, . . . , Y p ) T , and gy. be the marginal probability density function of Yj. The mutual 
information, denoted by /(Yi, . . . , Y p ), among random variables (Yi, . . . , Y p ), is defined to be 

I(Y 1 ,...,Y p ):="£H(Y j )-H(Y), (5) 

3=1 

where H(Y) = — J gy In gy and H(Yj) = — J gy. lngy^ are the Shannon entropy. Ideally, if 
W is properly chosen so that Y has independent components, then gy = ]T . gy. and, hence, 
I(Yi, . . . , Y p ) = 0. Thus, via minimizing /(Yi, . . . , Y p ) with respect to W, it leads to an 
estimate of W. Another method is to estimate W via maximizing the negentropy, which is 
equivalent to minimizing mutual information as described below. The negentropy of Y is 
defined to be 

J(Y) = H(Y') - H(Y), (6) 

where Y' is a Gaussian random vector having the same covariance matrix as Y (Hyvarinen 
and Oja, 2000). It can be deduced that 

I(Y 1: ...,Y P ) = J(Y) - J(Xi) - H(Y') + £ H^J) = J ( Y ) - E J (^)> ( ? ) 

j=i j=i j=i 

where the second equality holds since, by cov(Y') = cov(Y) = I p , H(Y') = Y7j=\H{Y-). 
Moreover, as Y = W T Z with W G SO p , we have J(Y) = J(Z), which does not depend on 
W. That is, the negentropy is invariant under orthogonal transformation. Thus, minimizing 
the mutual information /(Yi, . . . , Y p ) is equivalent to maximizing the negentropy Y^=i J(Yj)- 
The negentropy /(Yj), however, involves the unknown density gy.. To avoid nonparametric 
estimation of gy p one can use the following approximation (Hyvarinen, 1998) via a non- 
quadratic contrast function G(-), 

J{Y 3 ) « J G (Y}) = [E{G{Y-)} - E{G{u)}}\ (8) 
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where v is a random variable having the standard normal distribution. Here Jq can be 
treated as a measure of non-Gaussianity, and minimizing the sample analogue of Jg(Yj) to 
search W corresponds to the fast-ICA (Hyvarinen, 1999). 

Another widely used estimation criterion for W is via maximizing the likelihood. Under 
model (2) and by modeling gy j = fj with some known probability density function fj, the 
density function of Z takes the form 

f z (z;W) = \det(W)\ f[f j (w]z) = f[f j (w]z) (9) 

3=1 i=i 

since W G SO p and hence det(PU) = 1. The MLE-ICA then searches the optimum W via 

axgirinPo WO), (10) 

w&so p 

where T> (g,f) = — f g\n(f/g) is the Kullback-Leibler divergence (KL-divergence), and g n 
is the empirical distribution of {zi}f =1 . Possible choices of fj include fj(s) = c\ exp( — c 2 s 4 ) 
for sub-Gaussian models, and fj(s) = Ci/cosh(c 2 s) for super-Gaussian models, where c\ and 
C2 are constants so that fj is a probability density function. It can be seen from (9) that, for 
any row permutation matrix II, we have fz(z] n W) = fz{z] W). That is, we can estimate 
and identify A only up to its row-permutation. 

As will become clear later that the above mentioned methods are all related to minimiz- 
ing the KL-divergence, which is not robust in the presence of outliers. Outliers, however, 
frequently appear in real data analysis, and a robust ICA inference procedure becomes nec- 
essary. For the purpose of robustness, instead of the KL-divergence, Mihoko and Eguchi 
(2002) considers the minimum (3 -divergence estimation for W (/3-ICA). The issues of consis- 
tency and robustness of /3-ICA are discussed therein. On the other hand, the 7-divergence, 
which can be induced from /?- divergence, is shown to be super robust (Fujisawa and Eguchi, 
2008) against data contamination. It is our aim in this paper to propose a unified ICA 
inference procedure by minimum divergence estimation. Moreover, due to the property of 
super robustness, we will focus on the case of 7-divergence and propose a robust ICA proce- 
dure, called 7-ICA. Hyvarinen, Karhnen and Oja (2001) have provided a sufficient condition 
to ensure the validity of MLE-ICA under the orthogonality constraint of W, in the sense 
of being able to recover all independent components. Amari, Chen, and Cichocki (1997) 

5 



studied necessary and sufficient conditions for consistency under a different constraint of W, 
and this consistency result is further extended by Mihoko and Eguchi (2002) to the case of 
/3-ICA. In this work, we also derive necessary and sufficient conditions for the consistency of 
7-ICA. In the limiting case 7 — > 0, our necessary and sufficient condition for the consistency 
of MLE-ICA is weaker than the condition stated in Hyvarinen, Karhnen and Oja (2001). To 
the best of our knowledge, this result is not explored in existing literature. 

Some notation is defined here for the convenience of reference. For any M G IR pxp , let 
K p G M p2xp2 be the commutation matrix such that vec(M T ) = K p vec(M); M > (resp. < 0) 
means M is strictly positive (resp. negative) definite; and exp(M) := J2h=o TT * s ^ ne ma t r i x 
exponential. Note that det(exp(M)) = exp(tr(M)) for any nonsingular square matrix M. 
For a lower triangular matrix M with diagonals, vecp(M) stacks the nonzero elements of 
the columns of M into a vector with length p(p— l)/2. There exist matrices P G MP^ p ~ 1 ^ 2xp2 
and Q G W p2xp{p ~ 1)/2 such that vecp(M) = Pvec(M) and vec(M) = Qvecp(M). Each column 
vector of Q is of the form (e^ <8> &,-), % < j, where e, G MP is a vector with a one in the z-th 
position and elsewhere, and £§> is the Kronecker product. I p G IR pxp is the identity matrix 
and l p G MP is the p-vector of ones. 

The rest of this paper is organized as follows. A unified framework for ICA estimation 
by minimum divergence is introduced in Section 2. A robust 7-ICA procedure is developed 
in Section 3, wherein the related statistical properties are studied. A geometrical imple- 
mentation algorithm for 7-ICA is further illustrated in Section 4. In Section 5, the issue of 
selecting 7 value is discussed. Numerical studies are conducted in Section 6 to demonstrate 
the robustness of 7-ICA. The paper is ended with a conclusion in Section 7. All the proofs 
are placed in Appendix. 

2 Minimum £/- divergence estimation for ICA 

In this section we introduce a general framework for ICA by means of a minimum U- 
divergence, which covers the existing methods reviewed in Section 1. The aim of ICA is 
to search a matrix W G SO p so that the joint probability density function gy for Y = W T Z 
is as close to marginal product . gy i as possible. This aim then motivates estimating W 
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by minimizing a distance metric between gy and J"J . ^ . A general estimation scheme for 
W can be formulated through the following minimization problem 

j 

where ■) is a divergence function. Different choices of T> will lead to different esti- 
mation criteria for ICA. Here we will consider a general class of divergence functions, the 
[/-divergence (Murata et al., 2004; Eguchi, 2009), as described below. 

The [/-divergence is a very general class of divergence functions. Consider a strictly 
convex function U (t) defined on R, or on some interval of R where U(t) is well-defined. Let 
£ = U" 1 be the inverse function of U := 4:U(t). Consider 



! U({(f)) - U(t(g)) - U(t(g)) ■ {£(/) - £(g)} 
U(t(f))-U{t(g))-g-{Z(f)-Z(a)h (12) 



which defines a mapping from M.u'xM.u to [0, oo), where M.jj — {/ : J C/ (£(/)) < oo, / > 0}. 
Define the [/-cross entropy by 



Cu(g,f) = - J t(f)g + J U(t(f)), (13) 

and the [/-entropy by H u (g) = Cu(g,g). Then the [/-divergence can be written as 

T> u (g,f) = C u (g,f)-H u (g)>0. (14) 

In the subsequent subsections, we will introduce some special cases of [/-divergence, which 
will lead to specific methods of ICA. 

2.1 KL-divergence 

By taking the ([/, £) pair 

U(t) = exp{t), £(t)=lnt, (15) 

the corresponding [/-divergence is equivalent to the KL-divergence D . In this case, it can 
be deduced that 

V (g Y ,l[g Yj ) = I(Y u ...,Y p ), (16) 



where I(Yi, . . . , Y p ) is the mutual information defined in (5). As described in Section 1 that 

p p 

argmin /(Fi, . . . , Y p ) = argmax > J(Yj) ~ argmax > Jq(Yj), (17) 
W£SO p WGSOp ~[ WeSOp ~l 

we conclude that the following criteria, minimum mutual information, maximum negentropy, 
and fast-ICA, are all special cases of (11). On the other hand, observe that 

V (g Y (y),l[g Yj (y j ))=V (g z (z),l[g Yj (wjz)), (18) 

3 j 

where gz is the joint probability density function of Z. If we consider the model g Yj = fj, and 
if we estimate gz by its empirical probability mass function ~g n , minimizing (18) is equivalent 
to MLE-ICA in (10). In summary, choosing the KL-divergence D covers minimum mutual 
information, maximum negentropy, fast-ICA, and MLE-ICA. 

2.2 /3-divergence 

Consider the convex set Aip+i := {/ : J f^ +1 < oo, / > 0}. Take the (U,£) pair 

u{t) = T ^{i+pt) e ?, m = -^-i). (19) 

The resulting [/-divergence defined on Aip+i x JAp+i is calculated to be 

Bfiig, f) = ^Jis p - f)9 -J^lj - f" +l ) (2°) 

which is called /3-divergence (Mihoko and Eguchi, 2002), or density power divergence (Basu 
et al., 1998). Note that Bp(g, f ) = if and only if / = Xg for some A > 0. In the limiting 
case lim^o Bp = V , it gives the KL-divergence. If we replace V Q in (10) by Bp, it gives the 
0-ICA of Mihoko and Eguchi (2002). 

2.3 7-divergence 

The 7-divergence can be obtained from /3-divergence through a [/-volume normalization, 

V y {gJ) :=B y {a{g)-g,a{f)-f), 
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where £> 7 is defined the same way as (20) with the plug-in /3 = 7, and where 01(f) is some 
normalizing constant. Here we adopt the following normalization, called the volume- mass- 
one normalization, 

J U(t(a(f)-f(x)))dx = l. (21) 
It leads to a(f) = (7 + l) 1/(7+1) ||/||;+i. Then, 

It can be seen that 7-divergence is scale invariant. Moreover, T>^(g, /) = if and only if 
f = Xg for some A > 0. The 7-divergence, indexed by a power parameter 7, is a generalization 
of KL-divergence. In the limiting case lim 7 _ > o^7 = T^o, it gives the KL-divergence. It is 
well known that MLE (based on minimum KL-divergence) is not robust to outliers. On the 
other hand, the minimum 7-divergence estimation is shown to be super robust (Fujisawa and 
Eguchi, 2008) against data contamination. Hence, we will adopt 7-divergence to propose our 
robust 7-ICA procedure. In particular, the main idea of 7-ICA is to replace V in (10) by 
T> 1 . Though the idea is straightforward, there are many issues need to be studied. Detailed 
inference procedure and statistical properties of 7-ICA are discussed in Section 3. 



3 The 7-ICA inference procedure 

The ICA is actually a two-stage process. First, we need to whiten the data. The whitened 
data are then used for the recovery of independent sources. Since the main purpose of 
this study is to develop a robust ICA inference procedure, the robustness for both data 
prewhitening and independent source recovery should be guaranteed. Here we will utilize 
the 7-divergence to introduce a robust prewhitening method called 7-prewhitening, followed 
by illustrating 7-ICA based on the prewhitened data. In practice, the 7 value for 7-divergence 
should also be determined. In the rest of discussion, we will assume 7 is given, and leave the 
discussion of its selection to Section 5. 
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3.1 7"P rew hitening 

Although prewhitening is always possible by a straightforward standardization of X, there 
exists the issue of robustness of such a whitening procedure. It is well known that empirical 
moment estimates of (//, E) are very sensitive to outliers. In Mollah, Eguchi and Minami 
(2007), the authors proposed a robust /3-prewhitening procedure. In particular, let be 
the probability density function of p-variate normal distribution with mean [i and covariance 
E, and let g^x{x) be the empirical distribution based on data {xi}f =1 . With a given {3, Mollah 
et al. (2007) proposed the following minimum /3-divergence estimators 

(k,/2, E) = a,TgmmBp(gx, k ■ £ M , S ), (23) 

and then suggested to use (ju, E) for whitening the data. Interestingly, (ju, E) can also be 
derived from the minimum 7-divergence as 

(/2, E) = argminP 7 (^ x , £ MjS ). (24) 
At the stationarity of (24), the solutions (/x, E) will satisfy 

?= Eki!^b and % = {1+ y Tr^fcx)-(xi-fi)(xi-fi) T (25) 



where 



:) = exp <J —^{Xi - /i) T S 1 (x i - fj) 



The robustness property of (ju, E) can be found in Mollah et al. (2007). We call the 
prewhitening procedure 

Zi = E~ 1/2 (xi - ju), i = l,...,n (26) 
the 7-prewhitening. The whitened data {^}" =1 then enter the 7-ICA estimation procedure. 

3.2 Estimation of 7-ICA 

We are now in the position to develop our 7-ICA based on the 7-prewhitened data {zi}f =l . 
As discussed in Section 2.3, the W estimator is derived from 

W = argnmV^ n J z {- ] W)) (27) 

W&SOp 
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where fz{z; W) = THj=i fj( w J z ) an d fj is the working model for gy j . Since W G SO p , 
r z + \z; W)dz = |det(W)l n / fj +1 (v^ = flf /;•'(•'/, )^/,- 



which does not involve W. Thus, W can be equivalently obtained via 

— 1 n \ P } 

W = argmax C(W) : = argmax — } < IT fj(wjzi) > 
weso P wgso p n j-f U = x j 



(28) 



Finally, the mixing matrix A is estimated by A = T}/ 2 W . Let /(jy T z) = Y]^=i fj( w J z ) anc ^ 

0(iy T z) := [0 1 (wJz),...,0 p (wJz)] T , where ^ (j/) = — In (y) . 
We have the following proposition. 

Proposition 1. At the stationarity, the maximizer W defined in (28) will satisfy 

1 71 f-~- r 1 T - r^. nTl 

-^/t^t^ ^ 0(^ 2i) _0(^t^ ^t^ > = 0. (29) 
n i= i L L J J 

From Proposition 1, it can be easily seen the robustness nature of 7-ICA: the stationary 

equation is a weighted sum with the weight function f 1 . When 7 > 0, an outlier with 

extreme value will contribute less to the stationary equation. In the limiting case of 7 — > 0, 

which corresponds to MLE-ICA, the weight f 1 becomes uniform and, hence, is not robust. 



3.3 Consistency of 7-ICA 

A critical point to the likelihood-based ICA method is to specify a working model fj for gy j . 
A sufficient condition to ensure the consistency of MLE-ICA can be found in Hyvarinen, 
Karhnen and Oja (2001). Here the ICA consistency means recovery consistency. That is, an 
ICA procedure is said to be recovery consistent if it is able to recover all the independent 
components. Note that the consistency of MLE-ICA does not rely on the correct specification 
of working model fj, but only on the positivity of E[(j)j(Sj)Sj — 0'- (<%)], j = 1, . . . ,p. This 
subsection aims to investigate the consistency of 7-ICA for 7 G T = (0, r], where r > 
is some constant. We will deduce necessary and sufficient conditions such that 7-ICA is 
recovery consistent. The main result is summarized below. 

11 



Theorem 1. Assume the ICA model (2). Assume the existence of t for V = (0,r] such that 

(A) ElfJiS^Sj] = for all^eY and all j = l,...,p. 

Then, for 7 G T, the associated j-ICA is recovery consistent if and only if 

* 7 = Q T (Ip* - K p ) { 7 tf (1) + 7 2 * (2) } (J p2 - K p ) Q < 0, (30) 

where = £J =1 (e ie T ® E/,-) - (£> <8> /„), ^ = diag( Mjl , . . .,u jp ), u jk = Elf^S^Sj)^], 
D = diagfa, . . . , d p ), d,- = EiPiS^iS^}, and ¥ (2) = (5){0(,S)0 T (^) ® ,S5 T }] . 

Condition (A) of Theorem 1 can be treated as a weighted version of E(Sj) = 0. It is 
satisfied when Sj is symmetrically distributed about zero, and when the model probability 
density function fj is an even function. We believe condition (A) is not restrictive and 
should be approximately valid in practice. Notice that ^( 2 ) > 0. Thus, for the validity 
of (30), we must require that 7^(1) < 0, and the effect of 7 2 \I/(2) > can be exceeded 
by 7^(1) < 0. Fortunately, due to the coefficient 7 2 , when 7 is small, the effect of 7^(1) 
will eventually outnumber the effect of 7 2 \I/(2), so that \& 7 < can be ensured. In this 
situation, the negative definiteness of \l/ 7 mainly relies on the structure of ^(1). Moreover, a 
direct calculation gives Q T (I p 2 — Kp)^ M)(ip2 — K P )Q to be a diagonal matrix with diagonal 
elements {(v,j k — dj) + (u k j — d k ) ■ j < k}. We thus have the following corollary. 

Corollary 2. Assume the ICA model (2). Assume the existence of a small enough r for 
T = (0, r] such that 

(A) EtfiSdSj] = for 7 G r, j = 1, . . . ,p. 

(B) E\r{S){^{Si)Si ~ ^{S 3 )Sl}} + E[p(S){MSk)S k - <f>' k (S k )S*}] > for 7 G I\ for 
all pairs (j, k), j ^ k. 

Then, for every 7 G T, the associated '-/-ICA can recover all independent components. 

To understand the meaning of condition (B), we first consider an implication of Corol- 
lary 2 in the limiting case of 7 — > 0, which corresponds to the MLE-ICA. In this case, 
condition (A) becomes E(Sj) = 0, which is automatically true by the model assumption of 
S. Moreover, since E(Sj) = 1, condition (B) becomes 

l-- o J (S J )S J - ^.(S,)] + E[MSk)S k - <f)' k (S k )} > 0, for all pairs (j, k), j ± k. (31) 
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A sufficient condition to ensure the validity of (31) is 

EfoiSJSj - > 0, Vj, (32) 

which is the same condition given in Theorem 9.1 of Hyvarinen, Karhnen and Oja (2001) 
for the consistency of MLE-ICA. We should note that (31) is a weaker condition than (32). 
In fact, from the proof of Theorem 1, (31) is also a necessary condition. One implication of 
(31) is that, we can have at most one fj to be wrongly specified or at most one Gaussian 
component involved, and MLE-ICA is still able to recover all independent components. This 
can also be intuitively understood that once we have determined p — 1 directions in MP, the 
last direction is automatically determined. However, this fact cannot be observed from (32) 
which requires all fj to be correctly specified. We summarize the result for MLE-ICA below. 

Corollary 3. Assume the ICA model (2). Then, MLE-ICA is recovery consistent if and 
only if E[<t>j(Sj)Sj - <f>'j(Sj)] + E[(f) k (S k )S k - 4>' k (S k )] > for all pairs (j, k), j ± k. 

Turning to the case of 7-ICA, condition (B) of Corollary 2 can be treated as a weighted 
version of (31) with the weight function f 1 . However, one should notice that the validity of 
7-ICA has nothing to do with that of MLE-ICA, since there is no direct relationship between 
condition (B) and its limiting case (31). In particular, even if (31) is violated (i.e., MLE-ICA 
fails), with a proper choice of 7, it is still possible that condition (B) holds and, hence, the 
recovery consistency of 7-ICA can be guaranteed. 

Remark 4. By Theorem 1, a valid j-ICA procedure must correspond to ^ 7 < 0, or equiva- 
lent^, the maximum eigenvalue of i S 1 , denoted by A max (\I/ 7 ) ; must be negative. How should 
one pick a V -interval so that 7 £ V is legitimate in the sense that A max (\E' 7 ) < 0? Our sug- 
gestion for a rule of thumb is as follows. Let \& 7 be the empirical estimator of \I/ 7 based on 
the estimated source {s"i}" = i, where Sj := W T Zi. The plot of {(7, A max (^ r 7 ))} then provides 
a guidance in determining T, over which A max (^ r 7 ) should be far away below zero. With the 
Y -interval, a further selection procedure, introduced in Section 5, can be applied to select an 
optimal 7 value from V . It is confirmed in our numerical study in Section 6 that, the interval 
V , where A ma x(^I/ 7 ) < ; is quite wide, and the suggested rule does provide adequate choice of 
T. It also implies that the choice of t in Corollary 2 is not critical, as r is allowed to vary 
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in a wide range and not limited to very small number. It is the condition (B) that plays the 
most important role to ensure the recovery consistency ofj-ICA. 



3.4 /3-ICA versus 7-ICA 

By using /3-divergence, Mihoko and Eguchi (2002) proposed /3-ICA to recover independent 
components. The objective function of 0-1GA being maximized is of the form 

\det(W)f I J f(W T z)g z (z)dz - C/3 | , (33) 

where cp is a known constant. If we restrict W G SO p , then |det(W)| = 1 and maximizing 
(33) is equivalent to maximizing J f@(W T z)gz{z)dx, which has the same form with the 
population objective function of 7-ICA in (28). We should emphasize that Mihoko and 
Eguchi (2002) considered the ICA problem under the original AT-scale, while the constraint 
W G SO p is a consequence of prewhitening. Without considering the constraint W G SO p , 
the objective function of 7-ICA is deduced to be 

|det(W)|^|y P(W T z)g z (z)dz^ (34) 

which is different from (33). However, (33) is similar to (34) when cp is small. This fact also 
confirms the observation of Mihoko and Eguchi (2002) that setting cp = does not affect 
the performance of /3-ICA. In summary, 7-ICA and /3-ICA based on the whitened data Z 
are equivalent. For data X in original scale, however, 7-ICA maximizing (34) is different 
from /3-ICA maximizing (33), but they will have similar performance for small /3. 



4 Gradient method for 7-ICA on SO p 

In this section, we introduce an algorithm for estimating W constrained to the special 

orthogonal group SO p , which is a Lie group and is endowed with a manifold structure. 1 

The Lie group SO p , which is a path-connected subgroup of O p , consists of all orthogonal 

X Q is a Lie group if the group operations Q x Q — > Q defined by (x, y) — > xy and Q — > Q defined by 
x — > x^ 1 are both C°° mappings (Boothby, 1986). 
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matrices in MP xp with determinant one. 2 Recall £ being the objective function of 7-ICA 
maximization problem defined in (28). A desirable algorithm is to generate an increasing 
sequence {C(Wk)}^ =1 with Wk G SO p , such that {Wfc}^ converges to a local maximizer 
W* of £. Various approaches can be used to generate such a sequence {W^}^ in SO p , 
for instance, geodesic flows and quasi-geodesic flows (Nishimori and Akaho, 2005). Here we 
focus on geodesic flows on SO p . In particular, starting with the current Wk, the update 
Wk+i is selected from one geodesic path of Wk along the steepest ascent direction such that 
C(Wk+i) > C(Wk)- In fact, this approach has been applied to the general Stiefel manifold 
(Nishimori and Akaho, 2005). Below we briefly review the idea and then introduce our im- 
plementation algorithm for 7-ICA. We note that the proposed algorithm is also applicable 
to MLE-ICA by changing the corresponding objective function. 

Let Ty/SOp denote the tangent space of SO p at W . Consider a smooth path W(t) on 
SO p with W(0) = W. Differentiating W(t) T W(t) = I p yields the tangent space at W 

T w SOp = {WV :V e R pxp , V T = -V) . (35) 

Clearly, Tj p SO p is the set of all skew-symmetric matrices. Each geodesic path starting from 
I p has an intimate relation with the matrix exponential function. In fact, exp(^) G SO p 
if and only if V is skew-symmetric (see page 148 in Boothby, 1986; Proposition 9.2.5. in 
Marsden and Ratiu, 1998). Moreover, for any M G SO p , there exists (not unique) a skew- 
symmetric V such that M = exp(^). If the Killing metric (Nishimori and Akaho, 2005) 

g w (Y 1 , Y 2 ) := tT(Y?Y 2 ), where Y 1} Y 2 G T w SO p , 

is used, the geodesic path starting from I p in the direction V is given by 

mv,t) : t G R} with $(V,t):=exp(tV). (36) 

Since the Lie group is homogeneous, we can compute the gradient and geodesic at 

Wk G SOp by pulling them back to the identity I p and then transform back to Wk- In 

2 The reason why we consider SO p is that O p itself is not connected. In the case that the desired orthogonal 
matrix W has determinant — 1, our algorithm in fact searches for HW € SO p for some permutation matrix 
II with det(II) = -1. 
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the implementation algorithm, to ensure all the iterations lying on the manifold SO p , we 
update Wk+i through 

W k+l :=W k exp{t k V k ), (37) 

where the skew-symmetric matrix V k and the step size t k are chosen properly to meet the 
ascending condition C(W k +i) > C{W k ). Since, from (36), exp(t k V k ) lies on the geodesic 
path of I p , then W k+ \ = W k exp(t k V k ) must lie on the geodesic path of W k . Moreover, since 
det(iy fc+ i) = det(W k ) exp(O) = 1 by ti(V k ) = 0, the sequence in (37) satisfies W k E SO p for 
all k. The determination of the gradient direction V k and the step size t k is discussed below. 
To compute the gradient and geodesic at W k by pulling them back to I p , define 

JF Wk {W) :=C(W k W). (38) 

We then determine W k+ \ = W k exp(t k V k ) from one geodesic at I p in the direction of the 
projected gradient of J-~w k - Specifically, to ensure the ascending condition, we choose each 
skew-symmetric V k to be V//J-V fe ; the projected gradient of Fw k at I p , defined to be 



Vf/J'wk ■= argmin ||VJW fc - V\\, where VJV fc . 

v&T Ip so p ow 



w=i v 



n 
i=l 

This particular choice of V k ensures the existence of the step size t k for the ascending condi- 
tion. Note that in the case of SO p imposed with the Killing metric, the projected gradient 
coincides with the natural gradient introduced by Amari (1998). See also Fact 5 in Nishimori 
and Akaho (2005) for further details. As to the selection of the step size t k at each iteration 
k with W k and V k = Vf/F-Wk^ we propose to select t k such that W k exp(t k V k ) is the "first 
improved rotation". In particular, we consider t k = ap lk for some a > and < p < 1, 
where i k is the nonnegative integer. To proceed, we search i k such that 

£(W k exp(ap e *V k ))>£(W k ), where V k = V // JF Wh , 

and then update W k +i = W k exp(ap £h V k ). In our implementation, a = 1 and p = 0.5 are 
used. For the convergence issue, one can instead consider the Armijo rule for t k (given in 
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equation (40)). Our experiments show that the above "first improved rotation" rule works 
quite well. Lastly, in the implementation, to save the storage for Wk, we "rotate Z di- 
rectly" instead of manipulating W, where Z is the p x n data matrix whose columns are Zj, 
i — 1, . . . , n. That is, we use the update Z^ = W k T Z. To retrieve the matrix W, we simply 
do a matrix right division of the final Z and the initial Z. The algorithm for 7-ICA based 
on gradient ascend on SO v is summarized below. 

1. Initialization: a = 1, p = 0.5, prewhitened data Z\ = Z (p x n matrix). 

2. For each iteration k — 1, 2, 3, . . ., 

(i) Compute the skew-symmetric matrix in (39). 

(ii) For £k = 0, 1, 2, . . ., if J r v^ fc (exp(ap^ fe Vfc)) > Tw^Ip), then break the loop. 

(iii) Update Z^+i by exp(ap £k Vk) T Zk- Check the convergence criterion. If the criterion 
is not met, go back to (i). 

3. Output W = (ZxZjy 1 Z X ZJ. 



Finally, we would like to mention the convergence issue. The statement is similar to 
Proposition 1.2.1 of Bertsekas (2003). 

Theorem 5. Let £ be continuously differentiable on SO p , and J 7 be defined in (38). Let 
{Wk G SO p } be a sequence generated by Wk+i = Wk exp(t k Vk), where Vk is a projected gra- 
dient related (see (41 ) below) and tk is a properly chosen step size by the Armijo rule: reduce 
the step size tk = ap £k , i\. = 0, 1, 2, . . . , until the inequality holds for the first nonnegative £k, 

C{W k+ i) - C(W k ) = J=V fc (exp(t fc V fc )) - JFwM > r}t k tr (V//F^ k V k ) , (40) 

where < rj < 1 is a fixed constant. Then, every limit point W* of {Wk € SO p } is a 
stationary point, i.e., tr(VJ-^*V) = for all V G Ty/*SO p , or equivalently, Vf/J^w* — 0. 

The statement that Vk is a projected gradient related corresponds to the condition 

limsup tr(V//J-T fc V fc ) >0. (41) 
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This condition is true when Vk is the projected gradient V// J-"w k itself or some natural gradient 
M~ 1 V//J r w k (Theorem 1, Amari, 1998), where M is a Riemannian metric tensor, which is 
positive definite. 

5 Selection of 7 

The estimation process of 7-ICA consists of two steps: 7-prewhitening and the geometry- 
based estimation for W, in which the values of 7 are essential to have robust estimators. 
Hence, we carefully select the value of 7 based on the adaptive selection procedures proposed 
by Minami and Eguchi (2003) and Mollah et al. (2007). We first introduce a general idea 
and then apply the idea to the selection of 7 in both 7-prewhitening and 7-ICA. Define the 
measurement of generalization performance as 



where g is the underlying true joint probability density function of the data, fg is the con- 
sidered model for fitting, 9 1 := argmin^ T> 7 (g, fg) is the minimum 7-divergence estimator 
of 9, and ~g is the empirical estimate of g. The 70 is called the anchor parameter and is 
fixed at 70 = 1 throughout this paper. This value is empirically shown to be insensitive to 
the resultant estimators (Minami and Eguchi, 2003). Let C 7o (7) be the sample analogue of 
C-yoCt)- We propose to select the value of 7 over a predefined set T through 



For 7-prewhitening, g = gx and fg = £^,£ with 6 = (/i, £). For 7-ICA, g = g z and 
fe = fz(-,W) withO = W. 



The above selection criterion requires the estimation of C 70 (7). To avoid the problem of 
overfitting, we apply a fT-fold cross-validation. Let T be the whole data, and let K parti- 
tions of T be 7i, • • ■ , 7k, that is, T% f] Tj = if i 7^ j and T = Uf =1 Ti. The whole selection 
procedure is summarized below. 



C 10 { 1 ) = E[V 10 {gJ % )l 



(42) 



7 = argminC 70 (7). 

7gr 



(43) 



1. For k = 1, . . . , 
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(i) For every 7 G T, obtain 9\ := argmirig C 7 (<^ k \ fo), where ^ is the empirical 
estimate of g based on T\Tk- 

(ii) Compute the cross validation estimate C l0 (jf k \ f-^-k)), where is the empirical 
estimate of g based on %. 

2. Estimate C 70 (7) by 



and obtain 7 = argmin 7gr C 70 ( , y). 



Eventually, we have two optimal values of 7: 7 Mi s for 7-prewhitening and 7V for estima- 
tion of the recovering matrix W. 

6 Numerical experiments 

We conduct two numerical studies to demonstrate the robustness of the 7-ICA procedure. 
In the first study, the data is generated from independent sources with some known distri- 
butions. In the second study, we use transformations of Lena images to form mixed image. 

6.1 Simulated data 

We independently generate the two sources Sj, j = 1,2, from a non-Gaussian distribution 
with sample size n = 150 + n\. The observable X is then given by X = AS, where 



Among the n observations, we add to each of the last n\ observations a random noise e. The 
data thus contains 150 uncontaminated i.i.d. observations from the ICA model, X = AS, 
and Hi contaminated i.i.d. observations from X = AS + e, where e ~ N(fi, a 2 ^) with 
fi = (5, 5) and a = 5. We consider two situations for the independent source S — ('S'l; 5 , 2)- 

(i) UNIFORM SOURCE: Each Sj, j = 1,2, is generated from Uniform(-3, 3). 




(44) 



k=l 



1 2 



.4 



1 0.5 
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(ii) STUDENT-t SOURCE: Each Sj, j = 1, 2, is generated from t-distribution with 3 degrees 
of freedom. 

For the case of uniform source, we use the sub-Gaussian model fj(s) oc exp(— as 4 ) with 
c = 0.1, which ensures the variance under fj is close to unity. As to the case of t source, 
the super-Gaussian model fj(s) oc l/cosh(cs) is considered, and we follow the suggested 
range of Hyvarinen and Oja (2000) and set c = 1.5. We also implement MLE-ICA (using 
the geometrical algorithm introduced in Section 4) and fast-ICA (using the code available 
at http: //www. cis .hut . f i/projects/ ica/f astica/) based on the 7-prewhitened data for 
fair comparisons. To evaluate the performance of each method, we modify from the perfor- 
mance index of Parmar and Unhelkar (2009) by a rescaling and by replacing the 2-norm 
with 1-norm and define the following performance index 

n= -i Wf EfcM , -lW EfcM , -lH<l, (45) 
2p(p - 1) *Y I \maxj \irij\ J V max j Fji| / J 

where tt^ is the (i,j)-th element of II = AW T . We will expect II to be a permutation 
matrix, when the method performs well. In that situation, the value of ir should be very 
close to 0, and attains if IT is indeed a permutation matrix. Simulation results with 100 
replications are reported in Figure 1. 

For the case of no outliers (n x = 0), all three methods perform well except the performance 
index n of 7-ICA increases as 7 increases. This is reasonable since, according to Theorem 1, 
7-ICA may fail to apply when 7 is too large. However, this influence is not severe as the 
the performance index 7r is slightly increased only. As to the case of involving outliers 
{rii = 30), it can be seen that the proposed 7-prewhitening followed by 7-ICA does possess 
the advantage of robustness for a wide range of 7 values, while the other two methods are not 
able to recover the latent sources. The performance of 7-ICA becomes worse when 7 is small, 
since in the limiting case 7—^0, 7-ICA reduces to the non-robust MLE-ICA. We note that 
both 7-prewhitening and 7-ICA are critical. This can be seen from the poor performance 
of MLE-ICA and fast-ICA, even they use the 7-prewhitened data as the input. Indeed, 
7-prewhitening only ensures that we shift and rotate the data in a robust manner, while 
the outliers will still enter into the subsequent estimation process and, hence, a non-robust 
result is expected. In Figure 2 we report the scatter plots of the recovered sources A~ l X 
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from each method, of X, and of A~ 1 X for one simulation run (ni = 30). These plots still 
convey the same message that 7-ICA is the winner among three methods, where the pattern 
of the reconstructed sources from 7-ICA is the most close to that of A~ l X. 

6.2 Lena image 

We use the Lena picture to evaluate the performance of 7-ICA. In our experiment, we use the 
Lena image with 512 x 512 pixels. We construct four types of Lena as the latent independent 
sources S as shown in Figure 5. We randomly generate the mixing matrix to be A = I4I4 +C, 
where the elements of C G M 4x4 are independently generated from Uniform(— 0.3, 0.3). The 
observed mixed pictures are also placed in Figure 5, wherein about 30% of the pixels are 
added with random noise generated from iV(20, 50 2 ) for contamination. The aim of this data 
analysis is to recover the original Lena pictures based on the observed contaminated mixed 
pictures. In this analysis, the pixels are treated as the random sample, each with dimension 
4. We randomly select 1000 pixels to estimate the demixing matrix, and then apply it to 
reconstruct the whole source pictures. We conduct two scenarios to evaluate the robustness 
of each method: 

1. Using the original image X as the input (see the second row of Figure 5). 

2. Using the filtered image X* from X as the input (see the third row of Figure 5). 

The filtering process in the second scenario can be treated as a pre-processing to alleviate 
the influence of additive Gaussian noise. In both scenarios, the estimated demixing matrix 
is applied to the original images X to recover 5". Note that with Gaussian noise contamina- 
tion, conventional prewhitening by empirical moment estimators is not robust and, hence, 
both fast-ICA and MLE-ICA may fail to apply. Therefore, we prewhiten the data by 7- 
prewhitening first and then apply 7-ICA, MLE-ICA, and fast-ICA to the same whitened 
data for fair comparison. The plot {(7, A max (\I / 7 ))} introduced in the end of Section 3.3 is 
placed in Figure 3, which suggests that T = (0, 1] is a good candidate for possible 7 values. 
We then apply the cross-validation method developed in Section 5 to determine the opti- 
mal 7 G T. The estimated values of C 70 (7) are plotted in Figure 4, from which we select 
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7m,s = 0.2 for 7-prewhitening and 7V = 0.15 for 7-ICA. The recovered pictures are placed 
in Figures 6-8, where for each figure the first row is for Scenario- 1 and the second row is for 
Scenario-2. 

It can be seen that 7-ICA is the best performer under both scenarios, while MLE-ICA 
and fast-ICA cannot recover the source images well when data is contaminated. It also 
demonstrates the applicability of the proposed 7-selection procedure. We detect that MLE- 
ICA and fast-ICA perform better when using filtered images X*, but can still not reconstruct 
images as good as 7-ICA does. Interestingly, 7-ICA has a reverse performance, where the 
best reconstructed images are from the original images instead of the filtered ones. The 
filtering process, which aims to achieve robustness, replaces the original pixel value by the 
median of the pixel values over its neighborhood. Therefore, while filtering process will 
alleviate the influence of outlier, it is also possible to lose useful information at the same 
time. For instance, a pixel without being contaminated will still be replaced by certain 
median value during the filtering process. 7-ICA, however, works on the original data X 
that possesses all the information available, and then weights each pixel according to its 
observed value to achieve robustness. Hence, a better performance for 7-ICA based on the 
original images is reasonably expected. 

7 Conclusions 

In this paper, we introduce a unified estimation framework by means of minimum U- 
divergence. For the reason of robustness consideration, we further focus on the specific choice 
of 7-divergence, which gives the proposed 7-ICA inference procedure. Statistical properties 
are rigorously investigated. A geometrical algorithm based on gradient flows on orthogonal 
group is introduced to implement our 7-ICA. The performance of 7-ICA is evaluated through 
synthetic and real data examples. 

There are still many important issues that are not covered by this work. For example, 
we only consider full ICA problem, i.e., simultaneous extraction of allp independent compo- 
nents, which is unpractical in the case of large p. It is of interest to extend our current 7-ICA 
to partial 7-ICA. Another issue of interest is also related to the large-p-small-n scenario. In 
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this work, data have to be prewhitened before entering the 7-ICA procedure. Prewhitening 
can be very unstable especially when p is large. How to avoid such a difficulty is an inter- 
esting and challenging issue. Tensor data analysis is now becoming popular and attracts the 
attention of many researchers. Many statistical methods include ICA have been extended 
to deal with such a data structure by means of multilinear algebra techniques. An extension 
of 7-ICA to a multilinear setting to cover tensor data analysis is also of great interest for 
future study. 

References 

Amari, S., Chen, T. and Cichocki, A. (1997). Stability analysis of learning algorithms for 
blind source separation. Neural Networks, 10, 1345-1351. 

Amari, S. (1998). Natural gradient works efficiently in learning. Neural Computation, 10, 
251-276. 

Basu, A., Harris, I. R., Hjort, N. L. and Jones, M. C. (1998). Robust and efficient estimation 
by minimizing a density power divergence. Biometrika 85, 549-559. 

Bertsekas, D. P. (2003). Nonlinear Programming. Athena Scientific, Belmont, Massachusetts. 

Boothby, W. M. (1986). An Introduction to Differentiable Manifolds and Riemannian Ge- 
ometry. Academic Press. 

Edelman, A., Aris, T. A. and Smith, S. (1998). The geometry of algorithm with orthogo- 
nality constraints SIAM J. Matrix Anal. Appl. 20, 303-353. 

Eguchi, S. (2009). Information divergence geometry and the application to statistical ma- 
chine learning. In Information Theory and Statistical Learning, F. Emmert-Streib and 
M. Dehmer (eds.), 309-332. Springer, Berlin. 

Fiori, S. (2005). Quasi-geodesic neural learning algorithm over the orthogonal group: a 
tutorial. J. Machine Learning Research, 6, 743-781. 

Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against 
heavy contamination. Journal of Multivariate Analysis, 99, 2053-2081. 

23 



Horn, R. A. and Johnson, C. R. (1991). Topics in Matrix Analysis. Cambridge University 
Press, Cambridge; New York. 

Hyvarinen, A. (1998). New approximations of differential entropy for independent com- 
ponent analysis and projection pursuit. Advances in Neural Information Processing 
Systems, 10, 273-279. 

Hyvarinen, A. (1999). Fast and robust fixed-point algorithms for independent component 
analysis. IEEE Transactions on Neural Networks, 10, 626-634. 

Hyvarinen, A. and Oja, E. (2000). Independent component analysis: algorithm and appli- 
cations. Neural Networks, 13, 411-430. 

Hyvarinen, A., Karhnen, J. and Oja, E. (2001). Independent Component Analysis. Wiley 
Inter-Science. 

Magnus, J. R. and Neudecker, H. (1979). The commutation matrix: some properties and 
applications. Annals of Statistics, 7, 381-394. 

Marsden, J. E. and Ratiu, S. T. (1998). Introduction to Mechanics and Symmetry: A Basic 
Exposition of Classical Mechanical Systems. Springer. 

Mihoko, M. and Eguchi, S. (2002). Robust blind source separation by /3-divergence. Neural 
Computation, 14, 1859-1886. 

Minami, M. and Eguchi, S. (2003). Adaptive selection for minimum /3-divergence method. 
Proceedings of ICA-2003 Conference, Nara, Japan. 

Mollah, M. N. H., Eguchi, S. and Minami, M. (2007). Robust prewhitening for ICA by 
minimizing /3-divergence and its application to fastlCA. J. Neural Processing Letters, 
25, 91-110. 

Murata, N., Takenouchi, T., Kanamori, T. and Eguchi, S. (2004). Information geometry of 
U-boost and Bregman divergence. Neural Computation, 16, 1437-1481. 

Nishimori, Y. and Akaho, S. (2005). Learning algorithms utilizing quasi-geodesic flows on 
the Stiefel manifold. Neurocomputing, 67, 106-135. 

24 



Parmar, S. D. and Unhelkar, B. (2009). Performance analysis of ICA algorithms against 
multiple- sources interference in biomedical systems. International Journal of Recent 
Trends in Engineering, 2, 19-21. 



25 



(a) Uniform Source (n = 0) 



(c) f Source (n = 0) 
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(b) Uniform Source (n = 30) 
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Figure 1: The averages of the performance index it for different methods. 
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(a) Signal X 



(b) Source S 



c) y-ICA 



(d) MLE-ICA 



(e) fast-ICA 
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Figure 2: The scatter plots for the recovered independent components from different meth- 
ods, for the observed signals X, and for the true sources S. In each plot, the red dots are 
observations without contamination, and the blue pluses are contaminated ones, (a)-(e): 
Uniform source (Scenario-1), (f)-(j): t source (Scenario-2). 
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Figure 3: The maximum eigenvalue of \& 7 in (30) at different 7 values for the Lena data 
analysis. 
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(a) y-prewhitening (b) y-ICA 




Figure 4: The cross-validation estimates C 10 {j) with 70 = 1 for (a) 7-prewhitening and (b) 
7-ICA for the Lena data analysis. The red dot indicates the place where the minimum value 
is attained. 
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original image original image original image original image 




mixed image mixed image mixed image mixed image 




filtered image filtered image filtered image filtered image 




Figure 5: Four images of Lena (the first row), the mixed images with 30% pixels being 
contaminated (the second row), and the filtered images from the mixed images (the third 
row). 
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y-ICA y-ICA y-ICA y-ICA 




y-ICA (filtered) y-ICA (filtered) y-ICA (filtered) y-ICA (filtered) 




Figure 6: Recovered Lena images from 7-ICA based on the mixed images (the first row) and 
the filtered images (the second row). 
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MLE-ICA MLE-ICA MLE-ICA MLE-ICA 




MLE-ICA (filtered) MLE-ICA (filtered) MLE-ICA (filtered) MLE-ICA (filtered) 




Figure 7: Recovered Lena images from MLE-ICA based on the mixed images (the first row) 
and the filtered images (the second row). 
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fast-ICA fast-ICA fast-ICA fast-ICA 




fast-ICA (filtered) fast-ICA (filtered) fast-ICA (filtered) fast-ICA (filtered) 




Figure 8: Recovered Lena images from fast-ICA based on the mixed images (the first row) 
and the filtered images (the second row). 
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